A place to capture information about various SNP databases that exist. Some very leading edge as part of Sequencing test analysis. Others part of the slower, more formal refereed paper process. Eventually, all should be feeding into the NIH rsID master database. Also included are sites / tools to convert and lookup SNPs. The first table is for yDNA chromosome only databases since that represents the majority. The second table is for all / generic SNP databases across all DNA.
Most file formats have the chromosome and location within the chromosome defined. Some then have either an "rsID" identifier (from dbSNP listed below) or, in the case of yDNA, and SNP Name. A few cases, the files have some other, internal company identifier but the chromosome and location is usable to get an equivalent "rsID". Often, they were defined before submission to the database and the formats never updated. In the NGG file format for mtDNA and yDNA, they specifically only used the SNP Name and no other identification (no "rsID" nor location; the chromosome / DNA is known by context of the enclosing file).
OK, not really a SNP database or tool. But MapS Converter allows base-pair locus conversions between Build36 and Build37, and length conversion between base-pairs start/stop and centiMorgans.
To get some sense of content, here are some figures for the some of these databases:
snps_hg38.csv file dated 2 Aug 2020
# Entries: 1,263,058 (with 7,433 of then marked InDels)
# FT Entries: 362,547 (whereas the corresponding FTDNA FT Spreadsheet has 362,012)
# BY Entries: 264,697 (whereas the corresponding FTDNA BY Spreadsheet has 230,312)
# Y Entries: 195,034 (whereas yFull.com/snp-list as of 2 Aug 2020 has 205,115)
# YP Entries: 7,113 (whereas yFull.com/yp/snp-list as of 2 Aug 2020 has 6,442)
# FGC Entries: 125,598 (whereas the coresponding FGC spreadsheet has 65,536)
It is not clear how aliases for the same SNP are being handled. That is, same location with multiple names. Also not clear how many entries in yBrowse are STRs — which they minimally capture. Other sources indicate 15,000+ total STRs with NGS analysis groups FTDNA and yFull both having under 1,000 not-yet-defined in their analysis results.
Reference GRCh38 release from file GCF_000001405.38.gz dated 14 May 2020
1 It is reported that the actual, curated entries for dbSNP is around 100 million.
Most file formats have the chromosome and location within the chromosome defined. Some then have either an "rsID" identifier (from dbSNP listed below) or, in the case of yDNA, and SNP Name. A few cases, the files have some other, internal company identifier but the chromosome and location is usable to get an equivalent "rsID". Often, they were defined before submission to the database and the formats never updated. In the NGG file format for mtDNA and yDNA, they specifically only used the SNP Name and no other identification (no "rsID" nor location; the chromosome / DNA is known by context of the enclosing file).
DB | Site | SNP Pre-fix | Notes |
YBrowse DB | ISOGG | by Thomas Khran (DNA Fingerprint / FTDNA, YSEQ); contains SNP Names and locations | |
SNP Index | ISOGG | for SNP's placed in their tree | |
BY SNP Index @ ISOGG | FTDNA | BY | Complete list including early named FTDNA SNP's before full confirmation (mostly since BY500 test introduced) (227,533 entries as of Dec 2018) |
FT SNP Index @ ISOGG | FTDNA | FT | New list of additional SNP's (mostly since BY700 test introduced). Oddly claims replacing BY list but is much, much smaller list. (92,626 entries as of Jul 2019) |
FGC SNP Index @ ISOGG | FGC | FGC | Ditto for FGC |
SNP-List | yFull | Y | Very centric on their nomenclature; difficult to find comparable names already identified in other DB's |
Mutations | yTree | - | List is too short to be a complete list of variances they cover |
Variants | Haplogroup-R | HR |
DB | Site | SNP Pre-fix | Notes |
dbSNP | NIH | rs | Home of the rsID's (in both Build37 and Build38), SNP names and genes?) |
SNPedia | Same people behind Prometheus site (see also MIT Tech Review article) | ||
SNP-Nexus | |||
Ensemble and BioMart | EMBI-EBI | More than just the Human Genome | |
European Variation Archive | More than just the Human Genome |
- See Microarray Databases on Wikipedia for a list of curated entry databases.
- openSNP for a user-submitted, unreviewed, public submission, public database
- 1K Genome Project archive (now International Genome Sample Resource) (replicated at NIH and EBI)
- ENA (EBI) xxxxxx public submission database
To get some sense of content, here are some figures for the some of these databases:
snps_hg38.csv file dated 2 Aug 2020
# Entries: 1,263,058 (with 7,433 of then marked InDels)
# FT Entries: 362,547 (whereas the corresponding FTDNA FT Spreadsheet has 362,012)
# BY Entries: 264,697 (whereas the corresponding FTDNA BY Spreadsheet has 230,312)
# Y Entries: 195,034 (whereas yFull.com/snp-list as of 2 Aug 2020 has 205,115)
# YP Entries: 7,113 (whereas yFull.com/yp/snp-list as of 2 Aug 2020 has 6,442)
# FGC Entries: 125,598 (whereas the coresponding FGC spreadsheet has 65,536)
It is not clear how aliases for the same SNP are being handled. That is, same location with multiple names. Also not clear how many entries in yBrowse are STRs — which they minimally capture. Other sources indicate 15,000+ total STRs with NGS analysis groups FTDNA and yFull both having under 1,000 not-yet-defined in their analysis results.
Reference GRCh38 release from file GCF_000001405.38.gz dated 14 May 2020
SN | # Entries | # InDels | |
NC_000001.11 (chr1) | 54,954,827 | 3,490,824 InDels | |
NC_000002.12 (chr2) | 58,785,283 | 3,732,571 InDels | |
NC_000003.12 ... | 48,086,717 | 3,022,655 InDels | |
NC_000004.12 | 46,216,186 | 2,947,022 InDels | |
NC_000005.10 | 43,329,407 | 2,739,837 InDels | |
NC_000006.12 | 40,561,773 | 2,651,154 InDels | |
NC_000007.14 | 38,926,341 | 2,535,153 InDels | |
NC_000008.11 | 36,812,460 | 2,210,783 InDels | |
NC_000009.12 | 30,541,428 | 1,865,518 InDels | |
NC_000010.11 | 32,443,993 | 2,076,655 InDels | |
NC_000011.10 | 33,240,682 | 2,047,744 InDels | |
NC_000012.12 | 32,136,494 | 2,100,312 InDels | |
NC_000013.11 | 23,659,495 | 1,538,386 InDels | |
NC_000014.9 | 21,616,309 | 1,403,950 InDels | |
NC_000015.10 | 20,228,514 | 1,322,963 InDels | |
NC_000016.10 | 22,226,258 | 1,337,067 InDels | |
NC_000017.11 | 19,759,284 | 1,345,501 InDels | |
NC_000018.10 | 18,734,309 | 1,188,677 InDels | |
NC_000019.10 | 15,192,343 | 1,070,367 InDels | |
NC_000020.11 | 15,416,556 | 976,361 InDels | |
NC_000021.9 | 9,241,876 | 614,928 InDels | |
NC_000022.11 (chr22) | 9,625,604 | 639,800 InDels | |
NC_000023.11 (X) | 27,812,735 | 1,713,673 InDels | |
NC_000024.10 (Y) | 1,665,288 | 107,759 InDels | |
TOTAL | 701,214,162 1 | 44,679,660 |
See Also
Microarray File Formats, Sequencing File Formats,External References
- General SNP Development blog post by Roberta Estes upon occasion of FTDNA adding 100,000 SNPs to their Y SNP list
- Y SNP Prefix Code List by Rebekah Canada